20 research outputs found

    User Interaction Models for Disambiguation in Programming by Example

    Get PDF
    Programming by Examples (PBE) has the potential to revolutionize end-user programming by enabling end users, most of whom are non-programmers, to create small scripts for automating repetitive tasks. However, examples, though often easy to provide, are an ambiguous specification of the user's intent. Because of that, a key impedance in adoption of PBE systems is the lack of user confidence in the correctness of the program that was synthesized by the system. We present two novel user interaction models that communicate actionable information to the user to help resolve ambiguity in the examples. One of these models allows the user to effectively navigate between the huge set of programs that are consistent with the examples provided by the user. The other model uses active learning to ask directed example-based questions to the user on the test input data over which the user intends to run the synthesized program. Our user studies show that each of these models significantly reduces the number of errors in the performed task without any difference in completion time. Moreover, both models are perceived as useful, and the proactive active-learning based model has a slightly higher preference regarding the users' confidence in the result

    Computational Curation of Open Science Data

    No full text
    Thesis (Ph.D.)--University of Washington, 2018Rapid advances in data collection, storage and processing technologies are driving a new, data-driven paradigm in science. In the life sciences, progress is driven by plummeting genome sequencing costs, opening up new fields of bioinformatics, genomics, and systems biology. The return on the enormous investments into the collection and storage of the data is hindered by a lack of curation, leaving significant portion of the data stagnant and underused. In this dissertation, we introduce several approaches aimed at making open scientific data accessible, valuable, and reusable. First, in the Wide-Open project, we introduce a text mining system for detecting datasets that are referenced in published papers but are still kept private. After parsing over 1.5 million open access publications, Wide-Open has identified hundreds of datasets overdue for publication, 400 of them were then released within one week. Second, we propose a machine learning system, EZLearn, for annotating scientific data into potentially thousands of classes without manual work required to provide training labels. EZLearn is based on an observation that in scientific domains, data samples often come with natural language descriptions meant for human consumption. We take advantage of those descriptions by introducing an auxiliary natural language processing system, training it together with the main classifier in a co-training fashion. Third, we introduce Cedalion, a system that can capture scientific claims from papers, validate them against the data associated with the paper, then generalize and adapt the claims to other relevant datasets in the repository to gather additional statistical evidence. We evaluated Cedalion by applying it to gene expression datasets, and producing reports summarizing the evidence for or against the claim based on the entirety of the collected knowledge in the repository. We find that the claim-based algorithms we propose outperform conventional data integration methods and achieve high accuracy against manually validated claims

    Wide Open Science

    No full text
    A paper for “Imagining Tomorrow’s University: Rethinking scholarship, education, and institutions for an open, networked era” workshop

    Pathway Graphical Lasso

    No full text
    Graphical models provide a rich framework for summarizing the dependencies among variables. The graphical lasso approach attempts to learn the structure of a Gaussian graphical model (GGM) by maximizing the log likelihood of the data, subject to an l1 penalty on the elements of the inverse covariance matrix. Most algorithms for solving the graphical lasso problem do not scale to a very large number of variables. Furthermore, the learned network structure is hard to interpret. To overcome these challenges, we propose a novel GGM structure learning method that exploits the fact that for many real-world problems we have prior knowledge that certain edges are unlikely to be present. For example, in gene regulatory networks, a pair of genes that does not participate together in any of the cellular processes, typically referred to as pathways, is less likely to be connected. In computer vision applications in which each variable corresponds to a pixel, each variable is likely to be connected to the nearby variables. In this paper, we propose the pathway graphical lasso, which learns the structure of a GGM subject to pathway-based constraints. In order to solve this problem, we decompose the network into smaller parts, and use a message-passing algorithm in order to communicate among the subnetworks. Our algorithm has orders of magnitude improvement in run time compared to the state-of-the-art optimization methods for the graphical lasso problem that were modified to handle pathway-based constraints

    Wide-Open: Accelerating public data release by automating detection of overdue datasets

    No full text
    <div><p>Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.</p></div

    Number of samples in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO).

    No full text
    <p>Data underlying the figure are available as <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.2002477#pbio.2002477.s001" target="_blank">S1 Data</a>.</p

    Number of Gene Expression Omnibus (GEO) datasets overdue for release over time, as detected by Wide-Open.

    No full text
    <p>Prior to this submission, we notified GEO of the standing list, which led to the dramatic drop of overdue datasets (magenta portion), with 400 datasets released within the first week. Data underlying the figure are available as <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.2002477#pbio.2002477.s002" target="_blank">S2 Data</a>.</p

    Average delay from submission to release in the Gene Expression Omnibus (GEO).

    No full text
    <p>Data underlying the figure are available as <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.2002477#pbio.2002477.s003" target="_blank">S3 Data</a>.</p

    Stress-strain state analysis of the leading car body of DPKr-2 diesel train under action of design and operational loads

    No full text
    Purpose.Provision of strength and durability of the main structural element of DPKr-2 diesel train -the leading car body. Methodology. A spatial solid-state 3-D model of the body is built and durability calculations are carried out concerning action of loads stipulated by regulatory documents operating in Ukraine. In particular, the following main estimated modes are considered: mode 1 – a notional safety mode which takes into account the possibility of considerable longitudinal forces arising during shunting movements, transportation and accidental collision; mode 2 – an operational mode which takes into account forces acting on a train during acceleration to constructional speed, coasting or braking from this speed while passing a curve. Results. Based on the results of theoretical and experimental studies a conclusion has been made that the leading car body construction of DPKr-2 diesel train meets the requirements of regulatory documents regarding strength and durability. Practical relevance. A complex of calculation and experimental work concerning assessment of stress-strain state of the leading car body of DPKr-2 diesel train under action of design and operational loads allowed the creation of construction which meets not only operational requirements but also strength and durability ones

    Identifying Network Perturbation in Cancer

    No full text
    <div><p>We present a computational framework, called DISCERN (<b>DI</b>fferential <b>S</b>pars<b>E</b> <b>R</b>egulatory <b>N</b>etwork), to identify informative topological changes in gene-regulator dependence networks inferred on the basis of mRNA expression datasets within distinct biological states. DISCERN takes two expression datasets as input: an expression dataset of diseased tissues from patients with a disease of interest and another expression dataset from matching normal tissues. DISCERN estimates the extent to which each gene is <i>perturbed</i>—having distinct regulator connectivity in the inferred gene-regulator dependencies between the disease and normal conditions. This approach has distinct advantages over existing methods. First, DISCERN infers <i>conditional dependencies</i> between candidate regulators and genes, where conditional dependence relationships discriminate the evidence for direct interactions from indirect interactions more precisely than pairwise correlation. Second, DISCERN uses a new likelihood-based scoring function to alleviate concerns about accuracy of the specific edges inferred in a particular network. DISCERN identifies perturbed genes more accurately in synthetic data than existing methods to identify perturbed genes between distinct states. In expression datasets from patients with acute myeloid leukemia (AML), breast cancer and lung cancer, genes with high DISCERN scores in each cancer are enriched for known tumor drivers, genes associated with the biological processes known to be important in the disease, and genes associated with patient prognosis, in the respective cancer. Finally, we show that DISCERN can uncover potential mechanisms underlying network perturbation by explaining observed epigenomic activity patterns in cancer and normal tissue types more accurately than alternative methods, based on the available epigenomic data from the ENCODE project.</p></div
    corecore